About the project

This is the start of my course diary (check the link). Updates coming soon.

Description of the course

In this course I will learn about:

  • R
  • R Markdown
  • git
  • Data analysis

Chapter 2: Analysis of a survey data — test performance in a statistics course based on student variables

Introduction to the data

Lets read the data from a local folder and see how many observations and variables we have to work with.

learning2014 <- read.csv("data/learning2014.csv",row.names = "X")
dim(learning2014)
## [1] 166   7
str(learning2014)
## 'data.frame':    166 obs. of  7 variables:
##  $ gender  : Factor w/ 2 levels "F","M": 1 2 1 2 2 1 2 1 2 1 ...
##  $ age     : int  53 55 49 53 49 38 50 37 37 42 ...
##  $ attitude: num  3.7 3.1 2.5 3.5 3.7 3.8 3.5 2.9 3.8 2.1 ...
##  $ deep    : num  3.58 2.92 3.5 3.5 3.67 ...
##  $ stra    : num  3.38 2.75 3.62 3.12 3.62 ...
##  $ surf    : num  2.58 3.17 2.25 2.25 2.83 ...
##  $ points  : int  25 12 24 10 22 21 21 31 24 26 ...
summary(learning2014)
##  gender       age           attitude          deep            stra      
##  F:110   Min.   :17.00   Min.   :1.400   Min.   :1.583   Min.   :1.250  
##  M: 56   1st Qu.:21.00   1st Qu.:2.600   1st Qu.:3.333   1st Qu.:2.625  
##          Median :22.00   Median :3.200   Median :3.667   Median :3.188  
##          Mean   :25.51   Mean   :3.143   Mean   :3.680   Mean   :3.121  
##          3rd Qu.:27.00   3rd Qu.:3.700   3rd Qu.:4.083   3rd Qu.:3.625  
##          Max.   :55.00   Max.   :5.000   Max.   :4.917   Max.   :5.000  
##       surf           points     
##  Min.   :1.583   Min.   : 7.00  
##  1st Qu.:2.417   1st Qu.:19.00  
##  Median :2.833   Median :23.00  
##  Mean   :2.787   Mean   :22.72  
##  3rd Qu.:3.167   3rd Qu.:27.75  
##  Max.   :4.333   Max.   :33.00

We have 166 observations and 7 variables from a survey collected from a statistics course. There are two times more females in the data than males. The median age is 22. Attitude, deep, stra, surf are measured in Likert scale. Attitude is a combination of motivation to learn statistics and confidence in their mathematical skills. Deep is a combination of questions related to how people evaluate information and willingness to understand or apply it. Stra measures strategic behavior; how an individual works and plans. Surf on the other hand measures if people are not interested in their studies or concentrate on course requirements rather than on fully understanding the subject, that is, “surface learning” in other words. Points is the maximal overall points that the student got - or “performance”. From the code-blocks above, one can see some vital information about the variables.

Graphical summarization

Let’s show a graphical overview of the data.

library(ggplot2)
library(GGally)
theme_set(theme_light(base_size=9))
ggpairs(learning2014, mapping = aes(col = gender, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)))

The above graph summarizes all the relevant graphs of the variables distributions colored based on gender. As one can see, men seem to have a better attitude, less inclined for surface learning and perform slightly better. On the other hand, most females describe themselves as deeper learners and display a higher degree of strategic learning patterns. Another noticiable thing is that men are on average older than their female counterparts. Reader should recognize the different correlations between the variables. Most notable is the high correlation between exam performance and attitude points and the negative correlation of points and surface learning.

Fitting a regression

In this section we try to find a suitable regression formula that summarizes the performance (points) of the students well. We start by trying out some models (commented out) and finally arrive to a model we believe is a good approximation on the variables that have a biggest effect on performance based on statistical signifficance of the variables used.

model1 <- lm(points ~ deep + surf + gender, data = learning2014)
summary(model1)
## 
## Call:
## lm(formula = points ~ deep + surf + gender, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -14.8988  -3.6363   0.3912   4.2481  10.7905 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  29.8145     4.7353   6.296 2.75e-09 ***
## deep         -0.6964     0.8700  -0.800   0.4246    
## surf         -1.7461     0.9159  -1.906   0.0584 .  
## genderM       0.9827     0.9680   1.015   0.3115    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.857 on 162 degrees of freedom
## Multiple R-squared:  0.03062,    Adjusted R-squared:  0.01267 
## F-statistic: 1.706 on 3 and 162 DF,  p-value: 0.1679
#remove unsignifficant variables and try a different model

model2 <- lm(points ~ surf + age + stra, data = learning2014 )
summary(model2)
## 
## Call:
## lm(formula = points ~ surf + age + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -15.8229  -3.5885   0.4698   4.1271   9.7896 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 26.28166    3.68767   7.127  3.2e-11 ***
## surf        -1.56432    0.87103  -1.796   0.0744 .  
## age         -0.09642    0.05885  -1.638   0.1033    
## stra         1.04286    0.59393   1.756   0.0810 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.792 on 162 degrees of freedom
## Multiple R-squared:  0.05205,    Adjusted R-squared:  0.03449 
## F-statistic: 2.965 on 3 and 162 DF,  p-value: 0.03377
#remove unsignifficant variables and try a different model

model3 <- lm(points ~ surf + stra + attitude, data = learning2014 )
summary(model3)
## 
## Call:
## lm(formula = points ~ surf + stra + attitude, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.1550  -3.4346   0.5156   3.6401  10.8952 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  11.0171     3.6837   2.991  0.00322 ** 
## surf         -0.5861     0.8014  -0.731  0.46563    
## stra          0.8531     0.5416   1.575  0.11716    
## attitude      3.3952     0.5741   5.913 1.93e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.296 on 162 degrees of freedom
## Multiple R-squared:  0.2074, Adjusted R-squared:  0.1927 
## F-statistic: 14.13 on 3 and 162 DF,  p-value: 3.156e-08
#remove unsignifficant variables and try a different model

model4 <- lm(points ~ attitude + deep + gender, data = learning2014 )
summary(model4)
## 
## Call:
## lm(formula = points ~ attitude + deep + gender, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -17.0364  -3.2315   0.3561   3.9436  11.0859 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  13.6240     3.1799   4.284 3.13e-05 ***
## attitude      3.6657     0.5984   6.125 6.61e-09 ***
## deep         -0.6172     0.7546  -0.818    0.415    
## genderM      -0.4633     0.9170  -0.505    0.614    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.337 on 162 degrees of freedom
## Multiple R-squared:  0.1953, Adjusted R-squared:  0.1804 
## F-statistic:  13.1 on 3 and 162 DF,  p-value: 1.054e-07
#remove unsignifficant variables and try a different model

my_model <- lm(points ~ age + attitude + stra, data= learning2014)
summary(my_model)  
## 
## Call:
## lm(formula = points ~ age + attitude + stra, data = learning2014)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -18.1149  -3.2003   0.3303   3.4129  10.7599 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.89543    2.64834   4.114 6.17e-05 ***
## age         -0.08822    0.05302  -1.664   0.0981 .  
## attitude     3.48077    0.56220   6.191 4.72e-09 ***
## stra         1.00371    0.53434   1.878   0.0621 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.26 on 162 degrees of freedom
## Multiple R-squared:  0.2182, Adjusted R-squared:  0.2037 
## F-statistic: 15.07 on 3 and 162 DF,  p-value: 1.07e-08

The iterated model is

points ~ age + attitude + stra

Interpretation of the linear regression model

We can interpret our model as follows: if we have one point higher attitude (Likert scale), our test scores are on average three and a half points higher with a standard error of roughly 0.56. Age and strategic behaviour are statistically signifficant at p-value 10%, thus we exclude their interpretations since it might be just chance these variables have an effect on test scores which differs from zero. The adjusted R-squared, or the goodness-of-fit, is 0.2037 and the F-statistic is statistically signifficant. This means that the fit is good and it has statistical signifficance in explaining the test score performance. Note that the normal R-squared increases as one adds variables to the model, the adjusted one does not.

library(ggplot2)
par(mfrow = c(2,2))
plot(my_model, which(c(T,T,F,F,T,F)))


plot(density(my_model$residuals), main="Distibution of residuals")

In the plot with fitted values and residuals it is fairly clear that there is no apparent pattern between the two. This suggests that the size of the errors do not depend greatly on the explanatory variables used. We can see from the Q-Q plot that there is a reasonable fit on the line, meaning that the residuals are roughly normally distributed. This is demonstrated on the fourth plot where one can see the distribution of the residuals. On the leverage plot one can see that there are couple of observations which are quite different meaning that these observations might have a big impact on the results, though the numbers are small and we can still be confident in our model.

We can be fairly confident in our model that attitude has the biggest effect on exam performance. Multiple diagnostics of this model confirmed that there are no apparent reason to doubt that the residuals are correlated with the explanatory variables, the residuals are not normally distributed, or our results are based on very high impacting or leveraging outliers.


Chapter 3: Log-regression

Preliminary actions

Let’s first read the data from a local folder and print out the column names. This file has data on student alcohol consumption, descriptive variables, and most importantly number of absences and final grade (G3 - from 0 to 20). We will analyse the relatinship between alcohol consumption (binary variable high_use) to other variables (G3 test grade, mothers education, gender, and absences). Hypothesis: i) from previous studies the mothers educational background usually has had a good explanatory strength in the performance/behavior of their children. Furthermore, ii) males tend to drink more, iii) and absences captures the “dropout” behavior in the data. iv) Low test performance likely affects (and other way as well) tendency to drink because students are, in lack of other words, demoralized or simply sad. v) Freetime and goout measure the reported ‘boredom’ or neglect of coursework as well as amount of socializing of the students.

library(dplyr)
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
summarise(group_by(alc, high_use), testin_k_arvo = mean(G3), testin_hajonta = sd(G3))
## # A tibble: 2 × 3
##   high_use testin_k_arvo testin_hajonta
##      <lgl>         <dbl>          <dbl>
## 1    FALSE      11.73507       3.385467
## 2     TRUE      10.80702       3.042115
tarkasteltavat_muuttujat <- c("high_use","G3","Medu","sex","absences","freetime","goout")
data <- select(alc, one_of(tarkasteltavat_muuttujat))
glimpse(data)
## Observations: 382
## Variables: 7
## $ high_use <lgl> FALSE, FALSE, TRUE, FALSE, FALSE, FALSE, FALSE, FALSE...
## $ G3       <int> 8, 8, 11, 14, 12, 14, 12, 10, 18, 14, 12, 12, 13, 12,...
## $ Medu     <int> 4, 1, 1, 4, 3, 4, 2, 4, 3, 3, 4, 2, 4, 4, 2, 4, 4, 3,...
## $ sex      <fctr> F, F, F, F, F, M, M, F, M, M, F, F, M, M, M, F, F, F...
## $ absences <int> 5, 3, 8, 1, 2, 8, 0, 4, 0, 0, 1, 2, 1, 1, 0, 5, 8, 3,...
## $ freetime <int> 3, 3, 3, 2, 3, 4, 4, 1, 2, 5, 3, 2, 3, 4, 5, 4, 2, 3,...
## $ goout    <int> 4, 3, 2, 2, 2, 2, 4, 4, 2, 1, 3, 2, 3, 3, 2, 4, 3, 2,...
library(ggplot2)
library(tidyr)
gather(data) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") +geom_bar()
## Warning: attributes are not identical across measure variables; they will
## be dropped

Next we check the frequencies of high usage among different sexes and variables: grades (G3), absences, and Mothers education. We also plot boxplots to check the distribution among these subgroups.

library(dplyr)
data %>% group_by(sex,high_use) %>% summarise(count = n(),mean_grade = mean(G3))
## Source: local data frame [4 x 4]
## Groups: sex [?]
## 
##      sex high_use count mean_grade
##   <fctr>    <lgl> <int>      <dbl>
## 1      F    FALSE   156   11.39744
## 2      F     TRUE    42   11.71429
## 3      M    FALSE   112   12.20536
## 4      M     TRUE    72   10.27778
ggplot(alc, aes(x = high_use, y = G3, col=sex)) + geom_boxplot() + ylab("grade") + ggtitle("High Alcohol Usage and Test Performance")

data %>% group_by(sex,high_use) %>% summarise(count = n(), mean_absences = mean(absences))
## Source: local data frame [4 x 4]
## Groups: sex [?]
## 
##      sex high_use count mean_absences
##   <fctr>    <lgl> <int>         <dbl>
## 1      F    FALSE   156      4.224359
## 2      F     TRUE    42      6.785714
## 3      M    FALSE   112      2.982143
## 4      M     TRUE    72      6.125000
ggplot(alc, aes(x = high_use, y = absences, col=sex)) + geom_boxplot() + ylab("absences") + ggtitle("High Alcohol Usage and absences")

data %>% group_by(sex,high_use) %>% summarise(count = n(), mean_mothersEdu = mean(Medu))
## Source: local data frame [4 x 4]
## Groups: sex [?]
## 
##      sex high_use count mean_mothersEdu
##   <fctr>    <lgl> <int>           <dbl>
## 1      F    FALSE   156        2.673077
## 2      F     TRUE    42        2.785714
## 3      M    FALSE   112        2.973214
## 4      M     TRUE    72        2.847222
ggplot(alc, aes(x = high_use, y = Medu, col=sex)) + geom_boxplot() + ylab("Mothers Education") + ggtitle("High Alcohol Usage and Mothers Education")

data %>% group_by(sex,high_use) %>% summarise(count = n(), mean_freetime = mean(freetime))
## Source: local data frame [4 x 4]
## Groups: sex [?]
## 
##      sex high_use count mean_freetime
##   <fctr>    <lgl> <int>         <dbl>
## 1      F    FALSE   156      2.929487
## 2      F     TRUE    42      3.357143
## 3      M    FALSE   112      3.392857
## 4      M     TRUE    72      3.500000
  ggplot(alc, aes(x = high_use, y = freetime, col=sex)) + geom_boxplot() + ylab("Freetime") + ggtitle("High Alcohol Usage and freetime")

  data %>% group_by(sex,high_use) %>% summarise(count = n(), mean_goout = mean(goout))
## Source: local data frame [4 x 4]
## Groups: sex [?]
## 
##      sex high_use count mean_goout
##   <fctr>    <lgl> <int>      <dbl>
## 1      F    FALSE   156   2.955128
## 2      F     TRUE    42   3.357143
## 3      M    FALSE   112   2.714286
## 4      M     TRUE    72   3.930556
  ggplot(alc, aes(x = high_use, y = goout, col=sex)) + geom_boxplot() + ylab("Spend time with friends") + ggtitle("High Alcohol Usage and Going out")

It appears that test performance is lowe especially for men for high users of alcohol. Similar effect can be found for absences; high users of acohol are more likely to miss classes, especially for men. We did not find a clear difference based on Mothers education (our initial hypothesis seems to have been wrong) and therefore change our attention to freetime. High users seem to have more freetime, this might be due to neglect of work or just plain boredom. Similarly, individuals which are more social or have less preference to study tend to be higher drinkers.

The model

Next we fit a logistic model to explain high usage of alcohol. Our model is the following

high_use ~ freetime + absences + sex + goout

# find the model with glm()
m <- glm(high_use ~ freetime + absences + sex + goout, data = alc, family = "binomial")

# print out a summary of the model
summary(m)
## 
## Call:
## glm(formula = high_use ~ freetime + absences + sex + goout, family = "binomial", 
##     data = alc)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.7938  -0.8060  -0.5306   0.8297   2.4887  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -4.33104    0.57522  -7.529 5.10e-14 ***
## freetime     0.07254    0.13705   0.529 0.596610    
## absences     0.08487    0.02240   3.789 0.000151 ***
## sexM         0.93806    0.25762   3.641 0.000271 ***
## goout        0.71094    0.12444   5.713 1.11e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 465.68  on 381  degrees of freedom
## Residual deviance: 387.47  on 377  degrees of freedom
## AIC: 397.47
## 
## Number of Fisher Scoring iterations: 4
# print out the coefficients of the model
coef(m)
## (Intercept)    freetime    absences        sexM       goout 
## -4.33104146  0.07253827  0.08487446  0.93806336  0.71094375
#ODDS ratio of 1/6, individual is 1/6 likely to succeed with X, the exponent of the coefficients: exp(beta) = odds(Y|x+1)/odds(Y|x)

# compute odds ratios (OR)
OR <- coef(m) %>% exp

# compute confidence intervals (CI)
CI <- confint(m) %>% exp
## Waiting for profiling to be done...
# print out the odds ratios with their confidence intervals
cbind(OR, CI)
##                     OR      2.5 %     97.5 %
## (Intercept) 0.01315384 0.00406496 0.03896248
## freetime    1.07523395 0.82191021 1.40863837
## absences    1.08858040 1.04303580 1.14007640
## sexM        2.55502845 1.55086699 4.26706152
## goout       2.03591174 1.60487130 2.61688539

When goout (spending time with friends) is included in the model it seems that freetime does not have explanatory power in the model. From the print above, one can see that amount of absences, gender, and social activity seem to explain alcohol usage quite well. For example, an individual answering that they spend more time with friends (Likert scale) are twice as likely to use high amounts of alcohol, everything else held constant. Likewise, males are much more likely to spend more alcohol and students who are absent have a small risk factor. Note that freetime is not signifficant at p-value 10% (or 5%) and the confidence interwal includes 1 for odds ratios, which means that the model suggest that there is no clear causality weather students whom spend more freetime have a higher or lower risk to consume high amounts of alchol. The results show that the initial hypothesis on social behaviour, gender, and amount absences seem to be correct but ones on Mothers education (based on the graph above) or freetime spent were not.

Validation

Now we try to make predictions and cross-validation based on the model used.

# find the model with glm()
m <- glm(high_use ~ freetime + absences + sex + goout, data = alc, family = "binomial")

# predict() the probability of high_use
probabilities <- predict(m, type = "response")

# add the predicted probabilities to 'alc'
alc <- mutate(alc, probability = probabilities)

# use the probabilities to make a prediction of high_use
alc <- mutate(alc, prediction = probability > 0.5)

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction)
##         prediction
## high_use FALSE TRUE
##    FALSE   250   18
##    TRUE     64   50
# initialize a plot of 'high_use' versus 'probability' in 'alc'
g <- ggplot(alc, aes(x = probability, y = high_use, col=prediction))

# define the geom as points and draw the plot
g + geom_point()

# tabulate the target variable versus the predictions
table(high_use = alc$high_use, prediction = alc$prediction) %>% prop.table %>% addmargins()
##         prediction
## high_use      FALSE       TRUE        Sum
##    FALSE 0.65445026 0.04712042 0.70157068
##    TRUE  0.16753927 0.13089005 0.29842932
##    Sum   0.82198953 0.17801047 1.00000000
# define a loss function (mean prediction error)
loss_func <- function(class, prob) {
  n_wrong <- abs(class - prob) > 0.5
  mean(n_wrong)
}

# call loss_func to compute the average number of wrong predictions in the (training) data
loss_func(class = alc$high_use, prob = alc$probability)
## [1] 0.2146597
# K-fold cross-validation
library(boot)
cv <- cv.glm(data = alc, cost = loss_func, glmfit = m, K = 10)

# average number of wrong predictions in the cross validation
cv$delta[1]
## [1] 0.2225131

Based on the definition of high users our model estimates that 64 individuals would not be high users when infact they were, 18 would be high users when they were not and 50 where ‘correcly’ identified as high users. This is also shown in a graph with predictions as real values. Table shows the probabilities that the model is either correct or incorrect (type I and type II errors can be easily calculated from this). Noteworthy is that the loss function is lower (0.21) than the baseline (0.26), which suggests that our model seems to work better than the baseline model of including failures, absence and sex as explanatory variables. This is also the training error.


Chapter 4: Clustering

Preliminaries

Here we study the Boston data-set. There’s 504 observations and 14 variables. Basic information about the variables: crim is the per capita crime rate by town, dis is the weighted mean of distances to five Boston employment centres, ptratio is the pupil-teacher ratio by town, medv is the median value of owner-occupied homes per 1000 dollars. As we can see from the summary, crime is highly skewed with outliers. However, the ptratio and medv are distributed ‘fairly’ evenly.

library(MASS)
## 
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
## 
##     select
library(dplyr)
library(tidyr)
library(ggplot2)
library(corrplot)
data("Boston")
dim(Boston)
## [1] 506  14
str(Boston)
## 'data.frame':    506 obs. of  14 variables:
##  $ crim   : num  0.00632 0.02731 0.02729 0.03237 0.06905 ...
##  $ zn     : num  18 0 0 0 0 0 12.5 12.5 12.5 12.5 ...
##  $ indus  : num  2.31 7.07 7.07 2.18 2.18 2.18 7.87 7.87 7.87 7.87 ...
##  $ chas   : int  0 0 0 0 0 0 0 0 0 0 ...
##  $ nox    : num  0.538 0.469 0.469 0.458 0.458 0.458 0.524 0.524 0.524 0.524 ...
##  $ rm     : num  6.58 6.42 7.18 7 7.15 ...
##  $ age    : num  65.2 78.9 61.1 45.8 54.2 58.7 66.6 96.1 100 85.9 ...
##  $ dis    : num  4.09 4.97 4.97 6.06 6.06 ...
##  $ rad    : int  1 2 2 3 3 3 5 5 5 5 ...
##  $ tax    : num  296 242 242 222 222 222 311 311 311 311 ...
##  $ ptratio: num  15.3 17.8 17.8 18.7 18.7 18.7 15.2 15.2 15.2 15.2 ...
##  $ black  : num  397 397 393 395 397 ...
##  $ lstat  : num  4.98 9.14 4.03 2.94 5.33 ...
##  $ medv   : num  24 21.6 34.7 33.4 36.2 28.7 22.9 27.1 16.5 18.9 ...
glimpse(Boston)
## Observations: 506
## Variables: 14
## $ crim    <dbl> 0.00632, 0.02731, 0.02729, 0.03237, 0.06905, 0.02985, ...
## $ zn      <dbl> 18.0, 0.0, 0.0, 0.0, 0.0, 0.0, 12.5, 12.5, 12.5, 12.5,...
## $ indus   <dbl> 2.31, 7.07, 7.07, 2.18, 2.18, 2.18, 7.87, 7.87, 7.87, ...
## $ chas    <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...
## $ nox     <dbl> 0.538, 0.469, 0.469, 0.458, 0.458, 0.458, 0.524, 0.524...
## $ rm      <dbl> 6.575, 6.421, 7.185, 6.998, 7.147, 6.430, 6.012, 6.172...
## $ age     <dbl> 65.2, 78.9, 61.1, 45.8, 54.2, 58.7, 66.6, 96.1, 100.0,...
## $ dis     <dbl> 4.0900, 4.9671, 4.9671, 6.0622, 6.0622, 6.0622, 5.5605...
## $ rad     <int> 1, 2, 2, 3, 3, 3, 5, 5, 5, 5, 5, 5, 5, 4, 4, 4, 4, 4, ...
## $ tax     <dbl> 296, 242, 242, 222, 222, 222, 311, 311, 311, 311, 311,...
## $ ptratio <dbl> 15.3, 17.8, 17.8, 18.7, 18.7, 18.7, 15.2, 15.2, 15.2, ...
## $ black   <dbl> 396.90, 396.90, 392.83, 394.63, 396.90, 394.12, 395.60...
## $ lstat   <dbl> 4.98, 9.14, 4.03, 2.94, 5.33, 5.21, 12.43, 19.15, 29.9...
## $ medv    <dbl> 24.0, 21.6, 34.7, 33.4, 36.2, 28.7, 22.9, 27.1, 16.5, ...
summary(Boston)
##       crim                zn             indus            chas        
##  Min.   : 0.00632   Min.   :  0.00   Min.   : 0.46   Min.   :0.00000  
##  1st Qu.: 0.08204   1st Qu.:  0.00   1st Qu.: 5.19   1st Qu.:0.00000  
##  Median : 0.25651   Median :  0.00   Median : 9.69   Median :0.00000  
##  Mean   : 3.61352   Mean   : 11.36   Mean   :11.14   Mean   :0.06917  
##  3rd Qu.: 3.67708   3rd Qu.: 12.50   3rd Qu.:18.10   3rd Qu.:0.00000  
##  Max.   :88.97620   Max.   :100.00   Max.   :27.74   Max.   :1.00000  
##       nox               rm             age              dis        
##  Min.   :0.3850   Min.   :3.561   Min.   :  2.90   Min.   : 1.130  
##  1st Qu.:0.4490   1st Qu.:5.886   1st Qu.: 45.02   1st Qu.: 2.100  
##  Median :0.5380   Median :6.208   Median : 77.50   Median : 3.207  
##  Mean   :0.5547   Mean   :6.285   Mean   : 68.57   Mean   : 3.795  
##  3rd Qu.:0.6240   3rd Qu.:6.623   3rd Qu.: 94.08   3rd Qu.: 5.188  
##  Max.   :0.8710   Max.   :8.780   Max.   :100.00   Max.   :12.127  
##       rad              tax           ptratio          black       
##  Min.   : 1.000   Min.   :187.0   Min.   :12.60   Min.   :  0.32  
##  1st Qu.: 4.000   1st Qu.:279.0   1st Qu.:17.40   1st Qu.:375.38  
##  Median : 5.000   Median :330.0   Median :19.05   Median :391.44  
##  Mean   : 9.549   Mean   :408.2   Mean   :18.46   Mean   :356.67  
##  3rd Qu.:24.000   3rd Qu.:666.0   3rd Qu.:20.20   3rd Qu.:396.23  
##  Max.   :24.000   Max.   :711.0   Max.   :22.00   Max.   :396.90  
##      lstat            medv      
##  Min.   : 1.73   Min.   : 5.00  
##  1st Qu.: 6.95   1st Qu.:17.02  
##  Median :11.36   Median :21.20  
##  Mean   :12.65   Mean   :22.53  
##  3rd Qu.:16.95   3rd Qu.:25.00  
##  Max.   :37.97   Max.   :50.00
pairs(Boston)

plot(density(Boston$crim))

plot(density(Boston$ptratio))

plot(density(Boston$medv))

Correlations, factoring, and dividing the data to test and training sets

#calculate the correlations and plot them
cor_matrix <- cor(Boston)
cor_matrix %>% round(digits=2) %>% corrplot.mixed(tl.cex = 0.7,number.cex=0.7,order="hclust",addrect=2)

From the correlation pairs plot we can see the relevant correlations between the variables. As one could expect industrial areas are related to nitrogen oxides concentration and negatively correlated with residential areas and employment centers. Crime is correlated among others with radial highways and industrial areas.

Next we standardize the data sets. The distributions are easier to see with standardized variables. Let’s divide the crime variable to a factor of size four: “low” to “high”. We divide the data to trainingsets and testsets with a share of 80%.

boston_scaled <- scale(Boston)
summary(boston_scaled) 
##       crim                 zn               indus        
##  Min.   :-0.419367   Min.   :-0.48724   Min.   :-1.5563  
##  1st Qu.:-0.410563   1st Qu.:-0.48724   1st Qu.:-0.8668  
##  Median :-0.390280   Median :-0.48724   Median :-0.2109  
##  Mean   : 0.000000   Mean   : 0.00000   Mean   : 0.0000  
##  3rd Qu.: 0.007389   3rd Qu.: 0.04872   3rd Qu.: 1.0150  
##  Max.   : 9.924110   Max.   : 3.80047   Max.   : 2.4202  
##       chas              nox                rm               age         
##  Min.   :-0.2723   Min.   :-1.4644   Min.   :-3.8764   Min.   :-2.3331  
##  1st Qu.:-0.2723   1st Qu.:-0.9121   1st Qu.:-0.5681   1st Qu.:-0.8366  
##  Median :-0.2723   Median :-0.1441   Median :-0.1084   Median : 0.3171  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.:-0.2723   3rd Qu.: 0.5981   3rd Qu.: 0.4823   3rd Qu.: 0.9059  
##  Max.   : 3.6648   Max.   : 2.7296   Max.   : 3.5515   Max.   : 1.1164  
##       dis               rad               tax             ptratio       
##  Min.   :-1.2658   Min.   :-0.9819   Min.   :-1.3127   Min.   :-2.7047  
##  1st Qu.:-0.8049   1st Qu.:-0.6373   1st Qu.:-0.7668   1st Qu.:-0.4876  
##  Median :-0.2790   Median :-0.5225   Median :-0.4642   Median : 0.2746  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.6617   3rd Qu.: 1.6596   3rd Qu.: 1.5294   3rd Qu.: 0.8058  
##  Max.   : 3.9566   Max.   : 1.6596   Max.   : 1.7964   Max.   : 1.6372  
##      black             lstat              medv        
##  Min.   :-3.9033   Min.   :-1.5296   Min.   :-1.9063  
##  1st Qu.: 0.2049   1st Qu.:-0.7986   1st Qu.:-0.5989  
##  Median : 0.3808   Median :-0.1811   Median :-0.1449  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.4332   3rd Qu.: 0.6024   3rd Qu.: 0.2683  
##  Max.   : 0.4406   Max.   : 3.5453   Max.   : 2.9865
boston_scaled <- as.data.frame(boston_scaled)
scaled_crim <- boston_scaled$crim
summary(scaled_crim)
##      Min.   1st Qu.    Median      Mean   3rd Qu.      Max. 
## -0.419400 -0.410600 -0.390300  0.000000  0.007389  9.924000
bins <- quantile(scaled_crim)
bins
##           0%          25%          50%          75%         100% 
## -0.419366929 -0.410563278 -0.390280295  0.007389247  9.924109610
crime <- cut(scaled_crim, breaks= bins, include.lowest = TRUE, labels = c("low", "med_low", "med_high", "high"))
crime
##   [1] low      low      low      low      low      low      med_low 
##   [8] med_low  med_low  med_low  med_low  med_low  med_low  med_high
##  [15] med_high med_high med_high med_high med_high med_high med_high
##  [22] med_high med_high med_high med_high med_high med_high med_high
##  [29] med_high med_high med_high med_high med_high med_high med_high
##  [36] low      med_low  low      med_low  low      low      med_low 
##  [43] med_low  med_low  med_low  med_low  med_low  med_low  med_low 
##  [50] med_low  med_low  low      low      low      low      low     
##  [57] low      low      med_low  med_low  med_low  med_low  med_low 
##  [64] med_low  low      low      low      low      med_low  med_low 
##  [71] med_low  med_low  med_low  med_low  low      med_low  med_low 
##  [78] med_low  low      med_low  low      low      low      low     
##  [85] low      low      low      low      low      low      low     
##  [92] low      low      low      low      med_low  med_low  med_low 
##  [99] low      low      med_low  med_low  med_low  med_low  med_low 
## [106] med_low  med_low  med_low  med_low  med_high med_low  med_low 
## [113] med_low  med_low  med_low  med_low  med_low  med_low  med_low 
## [120] med_low  low      low      med_low  med_low  med_low  med_low 
## [127] med_high med_high med_high med_high med_high med_high med_high
## [134] med_high med_high med_high med_high med_high med_low  med_high
## [141] med_high med_high med_high high     med_high med_high med_high
## [148] med_high med_high med_high med_high med_high med_high med_high
## [155] med_high med_high med_high med_high med_high med_high med_high
## [162] med_high med_high med_high med_high med_high med_high med_high
## [169] med_high med_high med_high med_high med_low  med_low  med_low 
## [176] low      low      low      low      low      low      low     
## [183] med_low  med_low  med_low  low      low      low      med_low 
## [190] med_low  med_low  low      med_low  low      low      low     
## [197] low      low      low      low      low      low      low     
## [204] low      low      med_low  med_low  med_low  med_low  med_high
## [211] med_low  med_high med_low  med_low  med_high med_low  low     
## [218] low      med_low  med_low  med_high med_high med_high med_high
## [225] med_high med_high med_high med_high med_high med_high med_high
## [232] med_high med_high med_high med_high med_high med_high med_high
## [239] med_low  med_low  med_low  med_low  med_low  med_low  med_low 
## [246] med_low  med_high med_low  med_low  med_low  med_low  med_low 
## [253] med_low  med_high low      low      low      med_high med_high
## [260] med_high med_high med_high med_high med_high med_high med_high
## [267] med_high med_high med_high med_low  med_high med_low  med_low 
## [274] med_low  low      med_low  med_low  low      low      med_low 
## [281] low      low      low      low      low      low      low     
## [288] low      low      low      low      low      low      med_low 
## [295] low      med_low  low      med_low  low      low      low     
## [302] low      med_low  med_low  low      low      low      low     
## [309] med_high med_high med_high med_high med_high med_high med_high
## [316] med_low  med_high med_low  med_high med_high med_low  med_low 
## [323] med_high med_high med_high med_low  med_high med_low  low     
## [330] low      low      low      low      low      low      low     
## [337] low      low      low      low      low      low      low     
## [344] low      low      low      low      low      low      low     
## [351] low      low      low      low      low      med_low  high    
## [358] high     high     high     high     high     high     high    
## [365] med_high high     high     high     high     high     high    
## [372] high     high     high     high     high     high     high    
## [379] high     high     high     high     high     high     high    
## [386] high     high     high     high     high     high     high    
## [393] high     high     high     high     high     high     high    
## [400] high     high     high     high     high     high     high    
## [407] high     high     high     high     high     high     high    
## [414] high     high     high     high     high     high     high    
## [421] high     high     high     high     high     high     high    
## [428] high     high     high     high     high     high     high    
## [435] high     high     high     high     high     high     high    
## [442] high     high     high     high     high     high     high    
## [449] high     high     high     high     high     high     high    
## [456] high     high     high     high     high     high     high    
## [463] high     high     high     med_high high     high     high    
## [470] high     high     high     med_high high     high     high    
## [477] high     high     high     high     high     high     high    
## [484] med_high med_high med_high high     high     med_low  med_low 
## [491] med_low  med_low  med_low  med_low  med_high med_low  med_high
## [498] med_high med_low  med_low  med_low  low      low      low     
## [505] med_low  low     
## Levels: low med_low med_high high
boston_scaled <- dplyr::select(boston_scaled, -crim)
boston_scaled <- data.frame(boston_scaled, crime)

n <- nrow(boston_scaled)
n
## [1] 506
index <- sample(n, size = n * 0.8)
index
##   [1] 394 462 396   5 126 354 295 297 188  32 121 148 107 163 305 329 250
##  [18] 172 115 204 264  76 173 339 100 467 501  33 446 437 438 127 135 331
##  [35]  84 241 214 490 371 152 407 356  97 132 198 477 213  83 475 327 185
##  [52] 219 302 175 232 366 488 460  71 150 251 247 311  98 169 393 455 120
##  [69]  78 310  19 246 443  89  41 108 506 259 372 226 258 256 197 419 430
##  [86] 294  65 151 239 364 382 112 325 403 290 315 257  62 209 427  31 376
## [103]  17 405 300 266 384 203 134 352 483 255  42 478  90 408 465 413 238
## [120] 391 367 208   7 374  57 113 166 289  67 170 377 406 131 236   8 347
## [137]   2  79 200 268 320 395  92 283  28 275 357  11  72 359 426 341 457
## [154] 299 243 125 217 249 322 350 480 155 421 230 502 145 222 260 309 276
## [171]   1 410 282 234 215 368 487 474 133 358 439 415  69 167 355 379 240
## [188] 423 450 176 224 442 493  77 269 195 389 409 495 267 494 298 491 183
## [205] 212 324 351  39 397  91 448 292 400 262 160 273 482  59 194  10  87
## [222] 317  86 182 449 440  16 291 481  58 498 207 321 144 399  95 119 253
## [239] 420 452 179 186 123 147 286 171 353 466 301 124 369 103 431 463 114
## [256] 174 122 412 205 308 229 505 180 416 277 307 306 383  73 454  61 432
## [273] 444  56 279  74 190 244 130 362  44 184 335 111 343 154  48  82 116
## [290] 192 221  24 346 313 181  37 138 288 118  47 492 161 485 459 334 445
## [307] 458 216 336   3 378 202 263 461 285 245 388 429 101 333 293 365   9
## [324] 476 469  64  23 270 303 342 433 363  85 196 157   6  80  35 104 486
## [341] 146  66  29 316 242 418 473  40 252 361 159 117 165  93 265 453 199
## [358] 380  36 281 434  51 503 168 189 428 375 470  26 304  50 248 178 153
## [375] 489 472 447 227 237 233 193 451 404 330 280  21 201   4 162  38 385
## [392] 106 370 479 143 312  25 129  52 500 386 323  20 164
trainingset <- boston_scaled[index,]
testset <- boston_scaled[-index,]

correct_classes <- testset$crime
testset <- dplyr::select(testset, -crime)

The Linear Discriminant Analysis

Next we fit a lda model. We try to expain crime rate with other variables in the data. First level LDA does a good job on dividing the data. There are two main ‘groups’ high crime areas are vastly different from the rest groups. We can see that high crime is closely related to being close to radial highways (variable rad). This seems to be the dividing factor. From the testset and prediction table we can see that prediction of high crime is accurate but there is some dispersion in the med_high to low crime predictions. Our model seems to work quite adequately.

lda.fit <- lda(crime ~ ., data = trainingset)
lda.fit
## Call:
## lda(crime ~ ., data = trainingset)
## 
## Prior probabilities of groups:
##       low   med_low  med_high      high 
## 0.2549505 0.2475248 0.2376238 0.2599010 
## 
## Group means:
##                   zn      indus        chas        nox         rm
## low       1.00971304 -0.9420787 -0.15765625 -0.8964669  0.4754299
## med_low  -0.06918811 -0.2661348 -0.07547406 -0.5737607 -0.1368375
## med_high -0.37915407  0.2015343  0.17879700  0.4630668  0.1476340
## high     -0.48724019  1.0170492 -0.04735191  1.0582604 -0.4171019
##                 age        dis        rad        tax    ptratio
## low      -0.8537188  0.8956460 -0.6919669 -0.7403760 -0.5099792
## med_low  -0.2954614  0.3412147 -0.5397114 -0.4433276 -0.0796942
## med_high  0.4145037 -0.4163965 -0.4004598 -0.3061125 -0.4586876
## high      0.8092809 -0.8488686  1.6388211  1.5145512  0.7815834
##                black       lstat       medv
## low       0.37546905 -0.78695096  0.5456772
## med_low   0.30499162 -0.12347789 -0.0140051
## med_high  0.05080184 -0.04645095  0.2202353
## high     -0.73977387  0.89707666 -0.6732395
## 
## Coefficients of linear discriminants:
##                 LD1         LD2         LD3
## zn       0.07429930  0.64759747 -0.86406772
## indus    0.04473036 -0.26308873  0.33263942
## chas    -0.09549653 -0.04353417  0.05661754
## nox      0.38740700 -0.86329140 -1.31144646
## rm      -0.10782652 -0.09198197 -0.17276358
## age      0.23322005 -0.17118015 -0.04374182
## dis     -0.05804157 -0.18852476  0.11765028
## rad      3.15505806  0.83656790 -0.27161047
## tax      0.03229992  0.10376562  0.58040175
## ptratio  0.10469874  0.08146420 -0.15510958
## black   -0.10264241  0.03780032  0.13670777
## lstat    0.19947897 -0.15757292  0.42849545
## medv     0.18994864 -0.37011661 -0.08298239
## 
## Proportion of trace:
##    LD1    LD2    LD3 
## 0.9481 0.0382 0.0138
lda.arrows <- function(x, myscale = 1, arrow_heads = 0.1, color = "red", tex = 0.75, choices = c(1,2)){
  heads <- coef(x)
  arrows(x0 = 0, y0 = 0, 
         x1 = myscale * heads[,choices[1]], 
         y1 = myscale * heads[,choices[2]], col=color, length = arrow_heads)
  text(myscale * heads[,choices], labels = row.names(heads), 
       cex = tex, col=color, pos=3)
}

classes <- as.numeric(trainingset$crime)

plot(lda.fit, dimen =2, col = classes, pch = classes)
lda.arrows(lda.fit, myscale = 1)

lda.pred <- predict(lda.fit, newdata = testset)

table(correct = correct_classes, predicted = lda.pred$class)
##           predicted
## correct    low med_low med_high high
##   low       14       7        3    0
##   med_low    3      21        2    0
##   med_high   1      15       13    1
##   high       0       0        0   22

K-means clustering

Next we calculate the distance matrix to cluster the Boston data. As can be seen from pairs plot and ggpairs (this one is a bit messy), two centers seems to work quite well on the data. Number of centers is most appropriate when the total within sum of squares declines rapidly.

library(MASS)
data('Boston')
dist_euclidian <- dist(Boston)
summary(dist_euclidian)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.119  85.620 170.500 226.300 371.900 626.000
k_max <- 10
twcss <- sapply(1:k_max, function(k){kmeans(dist_euclidian, k)$tot.withinss})

plot(1:k_max, twcss, type='b')

km <- kmeans(dist_euclidian, centers=2)
pairs(Boston, col = c(km$cluster,alpha=0.2),pch=km$cluster)

library(GGally)
data.frame(km$cluster)[,1]
##   [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [36] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
##  [71] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 1 1
## [106] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [141] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [176] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [211] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [246] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [281] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [316] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
## [351] 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [386] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [421] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [456] 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2 2
## [491] 2 2 2 1 1 1 1 1 1 1 1 1 1 1 1 1
Boston2 <- mutate(Boston, cluster = as.factor(data.frame(km$cluster)[,1]))
ggpairs(Boston2, mapping = aes(col = cluster, alpha = 0.3), lower = list(combo = wrap("facethist", bins = 20)), title ="K-means (2) colored plots")
## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero

## Warning in cor(x, y, method = method, use = use): the standard deviation is
## zero


Chapter 5: Dimensionality reduction techniques

Preliminaries

Let’s first load the data we produced with the create_human.R script. This data is a collection of gross national income, expected education, life expectancy, education and labor shares between genders, maternal mortality rate, adolescent birth rate, and female political participation rate. In other words this data describes human development in countries.

We can see from the first plot below that GNI, maternal mortality rate, and adolescent birth rate are skewed and right tailed. High kurtosis and left tailed variables are: education ratio, labor ratio, and life expectancy. Expected education is distributed fairly similarly to a normal distribution.

From the second plot we can see that maternal mortality rate and adolescent birth rate are correlated and negatively correlated to high development variables such as: education ratio, GNI, and life expectancy.

human <- read.csv(file = "data/human.csv",row.names = "X")
head(human)
##               Edu2.FM   Labo.FM Life.Exp Edu.Exp   GNI Mat.Mor Ado.Birth
## Norway      1.0072389 0.8908297     81.6    17.5 64992       4       7.8
## Australia   0.9968288 0.8189415     82.4    20.2 42261       6      12.1
## Switzerland 0.9834369 0.8251001     83.0    15.8 56431       6       1.9
## Denmark     0.9886128 0.8840361     80.2    18.7 44025       5       5.1
## Netherlands 0.9690608 0.8286119     81.6    17.9 45435       6       6.2
## Germany     0.9927835 0.8072289     80.9    16.5 43919       7       3.8
##             Parli.F
## Norway         39.6
## Australia      30.5
## Switzerland    28.5
## Denmark        38.0
## Netherlands    36.9
## Germany        36.9
summary(human)
##     Edu2.FM          Labo.FM          Life.Exp        Edu.Exp     
##  Min.   :0.1717   Min.   :0.1857   Min.   :49.00   Min.   : 5.40  
##  1st Qu.:0.7264   1st Qu.:0.5984   1st Qu.:66.30   1st Qu.:11.25  
##  Median :0.9375   Median :0.7535   Median :74.20   Median :13.50  
##  Mean   :0.8529   Mean   :0.7074   Mean   :71.65   Mean   :13.18  
##  3rd Qu.:0.9968   3rd Qu.:0.8535   3rd Qu.:77.25   3rd Qu.:15.20  
##  Max.   :1.4967   Max.   :1.0380   Max.   :83.50   Max.   :20.20  
##       GNI            Mat.Mor         Ado.Birth         Parli.F     
##  Min.   :   581   Min.   :   1.0   Min.   :  0.60   Min.   : 0.00  
##  1st Qu.:  4198   1st Qu.:  11.5   1st Qu.: 12.65   1st Qu.:12.40  
##  Median : 12040   Median :  49.0   Median : 33.60   Median :19.30  
##  Mean   : 17628   Mean   : 149.1   Mean   : 47.16   Mean   :20.91  
##  3rd Qu.: 24512   3rd Qu.: 190.0   3rd Qu.: 71.95   3rd Qu.:27.95  
##  Max.   :123124   Max.   :1100.0   Max.   :204.80   Max.   :57.50
dim(human)
## [1] 155   8
str(human)
## 'data.frame':    155 obs. of  8 variables:
##  $ Edu2.FM  : num  1.007 0.997 0.983 0.989 0.969 ...
##  $ Labo.FM  : num  0.891 0.819 0.825 0.884 0.829 ...
##  $ Life.Exp : num  81.6 82.4 83 80.2 81.6 80.9 80.9 79.1 82 81.8 ...
##  $ Edu.Exp  : num  17.5 20.2 15.8 18.7 17.9 16.5 18.6 16.5 15.9 19.2 ...
##  $ GNI      : int  64992 42261 56431 44025 45435 43919 39568 52947 42155 32689 ...
##  $ Mat.Mor  : int  4 6 6 5 6 7 9 28 11 8 ...
##  $ Ado.Birth: num  7.8 12.1 1.9 5.1 6.2 3.8 8.2 31 14.5 25.3 ...
##  $ Parli.F  : num  39.6 30.5 28.5 38 36.9 36.9 19.9 19.4 28.2 31.4 ...
library(GGally)
library(corrplot)
library(tidyr)
library(dplyr)

# visualize the 'human_' variables
ggpairs(human)

# compute the correlation matrix and visualize it with corrplot
cor(human) %>% round(digits=2) %>% corrplot.mixed(tl.cex = 0.7,number.cex=0.7,order="hclust",addrect=2)

Principal component analysis (PCA)

Let’s perform a PCA on the data. As we can see, without standardization the plot is quite a mess.

# perform principal component analysis (with the SVD method)
pca_human_nstd <- prcomp(human)

# rounded percetanges of variance captured by each PC
pca_pr <- round(1*summary(pca_human_nstd)$importance[2, ], digits = 5)*100

# print out the percentages of variance
pca_pr
##   PC1   PC2   PC3   PC4   PC5   PC6   PC7   PC8 
## 99.99  0.01  0.00  0.00  0.00  0.00  0.00  0.00
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot of the principal component representation and the original variables
biplot(pca_human_nstd, choices = 1:2, cex = c(0.4, 0.4), col = c("grey60","navyblue"), xlab = pc_lab[1], ylab = pc_lab[2])
## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

## Warning in arrows(0, 0, y[, 1L] * 0.8, y[, 2L] * 0.8, col = col[2L], length
## = arrow.len): zero-length arrow is of indeterminate angle and so skipped

Let’s standardize the data now and interprit the results. Without standardisation GNI seems to be orthogonal to all other components, this is because it’s variance is the largest. With standardisation we can see more clearly which components are correlated and which atribute to principal components 1 and 2. Again, the correlations we interpreted before hold. Since female participation and labor ratio point up, this means they are orhogonal, or do not correlate with other variables in the data.

This results seems appropriate, if in a country the child mortality rate is high or the mothers die younger, it is plausible that this has an effect to life expectancy directly but indirectly to expected education rate and therefore to gross national income. One could go as far as to say that the contribution of mothers or more generally females in an important factor in development (PC1). Other part contributing to welfare are the highly correlated components such as GNI, life expectancy, and education. Furhtermore education of females.

# standardize the variables
human_std <- scale(human)

# print out summaries of the standardized variables
summary(human_std)
##     Edu2.FM           Labo.FM           Life.Exp          Edu.Exp       
##  Min.   :-2.8189   Min.   :-2.6247   Min.   :-2.7188   Min.   :-2.7378  
##  1st Qu.:-0.5233   1st Qu.:-0.5484   1st Qu.:-0.6425   1st Qu.:-0.6782  
##  Median : 0.3503   Median : 0.2316   Median : 0.3056   Median : 0.1140  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.5958   3rd Qu.: 0.7350   3rd Qu.: 0.6717   3rd Qu.: 0.7126  
##  Max.   : 2.6646   Max.   : 1.6632   Max.   : 1.4218   Max.   : 2.4730  
##       GNI             Mat.Mor          Ado.Birth          Parli.F       
##  Min.   :-0.9193   Min.   :-0.6992   Min.   :-1.1325   Min.   :-1.8203  
##  1st Qu.:-0.7243   1st Qu.:-0.6496   1st Qu.:-0.8394   1st Qu.:-0.7409  
##  Median :-0.3013   Median :-0.4726   Median :-0.3298   Median :-0.1403  
##  Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000   Mean   : 0.0000  
##  3rd Qu.: 0.3712   3rd Qu.: 0.1932   3rd Qu.: 0.6030   3rd Qu.: 0.6127  
##  Max.   : 5.6890   Max.   : 4.4899   Max.   : 3.8344   Max.   : 3.1850
# perform principal component analysis (with the SVD method)
pca_human <- prcomp(human_std)

# rounded percetanges of variance captured by each PC
pca_pr <- round(1*summary(pca_human)$importance[2, ], digits = 5)*100

# print out the percentages of variance
pca_pr
##    PC1    PC2    PC3    PC4    PC5    PC6    PC7    PC8 
## 53.605 16.237  9.571  7.583  5.477  3.595  2.634  1.298
# create object pc_lab to be used as axis labels
pc_lab <- paste0(names(pca_pr), " (", pca_pr, "%)")

# draw a biplot of the principal component representation and the original variables
biplot(pca_human, choices = 1:2, cex = c(0.4, 0.4), col = c("grey60","navyblue"), xlab = pc_lab[1], ylab = pc_lab[2])

Multiple Correspondence Analysis

Next we do a MCA on the tea dataset. There are 300 observations and 36 variables. Let’s cut down on the variables we analyze to a more manageable rate. We include the type of tea (black, green etc.), with or without bonuses (lemon, milk etc.), type of tea package (bag, loose etc.), price (cheap, upscale etc.), with or without sugar,where tea is enjoyed (store, tea shop etc.), and whether or not drinking is done with breakfast.

library(FactoMineR)
library(ggplot2)
library(tidyr)

data(tea)
dim(tea)
## [1] 300  36
str(tea)
## 'data.frame':    300 obs. of  36 variables:
##  $ breakfast       : Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
##  $ tea.time        : Factor w/ 2 levels "Not.tea time",..: 1 1 2 1 1 1 2 2 2 1 ...
##  $ evening         : Factor w/ 2 levels "evening","Not.evening": 2 2 1 2 1 2 2 1 2 1 ...
##  $ lunch           : Factor w/ 2 levels "lunch","Not.lunch": 2 2 2 2 2 2 2 2 2 2 ...
##  $ dinner          : Factor w/ 2 levels "dinner","Not.dinner": 2 2 1 1 2 1 2 2 2 2 ...
##  $ always          : Factor w/ 2 levels "always","Not.always": 2 2 2 2 1 2 2 2 2 2 ...
##  $ home            : Factor w/ 2 levels "home","Not.home": 1 1 1 1 1 1 1 1 1 1 ...
##  $ work            : Factor w/ 2 levels "Not.work","work": 1 1 2 1 1 1 1 1 1 1 ...
##  $ tearoom         : Factor w/ 2 levels "Not.tearoom",..: 1 1 1 1 1 1 1 1 1 2 ...
##  $ friends         : Factor w/ 2 levels "friends","Not.friends": 2 2 1 2 2 2 1 2 2 2 ...
##  $ resto           : Factor w/ 2 levels "Not.resto","resto": 1 1 2 1 1 1 1 1 1 1 ...
##  $ pub             : Factor w/ 2 levels "Not.pub","pub": 1 1 1 1 1 1 1 1 1 1 ...
##  $ Tea             : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How             : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ sugar           : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ how             : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ where           : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price           : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ age             : int  39 45 47 23 48 21 37 36 40 37 ...
##  $ sex             : Factor w/ 2 levels "F","M": 2 1 1 2 2 2 2 1 2 2 ...
##  $ SPC             : Factor w/ 7 levels "employee","middle",..: 2 2 4 6 1 6 5 2 5 5 ...
##  $ Sport           : Factor w/ 2 levels "Not.sportsman",..: 2 2 2 1 2 2 2 2 2 1 ...
##  $ age_Q           : Factor w/ 5 levels "15-24","25-34",..: 3 4 4 1 4 1 3 3 3 3 ...
##  $ frequency       : Factor w/ 4 levels "1/day","1 to 2/week",..: 1 1 3 1 3 1 4 2 3 3 ...
##  $ escape.exoticism: Factor w/ 2 levels "escape-exoticism",..: 2 1 2 1 1 2 2 2 2 2 ...
##  $ spirituality    : Factor w/ 2 levels "Not.spirituality",..: 1 1 1 2 2 1 1 1 1 1 ...
##  $ healthy         : Factor w/ 2 levels "healthy","Not.healthy": 1 1 1 1 2 1 1 1 2 1 ...
##  $ diuretic        : Factor w/ 2 levels "diuretic","Not.diuretic": 2 1 1 2 1 2 2 2 2 1 ...
##  $ friendliness    : Factor w/ 2 levels "friendliness",..: 2 2 1 2 1 2 2 1 2 1 ...
##  $ iron.absorption : Factor w/ 2 levels "iron absorption",..: 2 2 2 2 2 2 2 2 2 2 ...
##  $ feminine        : Factor w/ 2 levels "feminine","Not.feminine": 2 2 2 2 2 2 2 1 2 2 ...
##  $ sophisticated   : Factor w/ 2 levels "Not.sophisticated",..: 1 1 1 2 1 1 1 2 2 1 ...
##  $ slimming        : Factor w/ 2 levels "No.slimming",..: 1 1 1 1 1 1 1 1 1 1 ...
##  $ exciting        : Factor w/ 2 levels "exciting","No.exciting": 2 1 2 2 2 2 2 2 2 2 ...
##  $ relaxing        : Factor w/ 2 levels "No.relaxing",..: 1 1 2 2 2 2 2 2 2 2 ...
##  $ effect.on.health: Factor w/ 2 levels "effect on health",..: 2 2 2 2 2 2 2 2 2 2 ...
#summary(tea)

# column names to keep in the dataset
# select the columns to keep to create a new dataset
library(dplyr)
tea_time <- select(tea,one_of(c("Tea","How","how","price","sugar","where","breakfast")))


# look at the summaries and structure of the data
#summary(tea_time)
str(tea_time)
## 'data.frame':    300 obs. of  7 variables:
##  $ Tea      : Factor w/ 3 levels "black","Earl Grey",..: 1 1 2 2 2 2 2 1 2 1 ...
##  $ How      : Factor w/ 4 levels "alone","lemon",..: 1 3 1 1 1 1 1 3 3 1 ...
##  $ how      : Factor w/ 3 levels "tea bag","tea bag+unpackaged",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ price    : Factor w/ 6 levels "p_branded","p_cheap",..: 4 6 6 6 6 3 6 6 5 5 ...
##  $ sugar    : Factor w/ 2 levels "No.sugar","sugar": 2 1 1 2 1 1 1 1 1 1 ...
##  $ where    : Factor w/ 3 levels "chain store",..: 1 1 1 1 1 1 1 1 2 2 ...
##  $ breakfast: Factor w/ 2 levels "breakfast","Not.breakfast": 1 1 2 2 1 2 1 2 1 1 ...
#tea_time
#how, price, where
library(plyr)
## -------------------------------------------------------------------------
## You have loaded plyr after dplyr - this is likely to cause problems.
## If you need functions from both plyr and dplyr, please load plyr first, then dplyr:
## library(plyr); library(dplyr)
## -------------------------------------------------------------------------
## 
## Attaching package: 'plyr'
## The following objects are masked from 'package:dplyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
#levels(tea_time$how)
tea_time$how <- mapvalues(tea_time$how, from = c("tea bag", "tea bag+unpackaged", "unpackaged"), to = c("bag", "both", "loose")) 
#levels(tea_time$how)

#levels(tea_time$price)
tea_time$price <- mapvalues(tea_time$price, from = c("p_branded","p_cheap","p_private label","p_unknown","p_upscale","p_variable"),to = c("branded", "cheap", "private","unknown","upscale","variable")) 
#levels(tea_time$price)

#levels(tea_time$where)
tea_time$where <- mapvalues(tea_time$where, from = c("chain store","chain store+tea shop","tea shop" ),to = c("store", "both", "shop"))
#levels(tea_time$where)

# visualize the dataset
gather(tea_time) %>% ggplot(aes(value)) + facet_wrap("key", scales = "free") + geom_bar() + theme(axis.text.x = element_text(angle = 25, hjust = 1, size = 8))
## Warning: attributes are not identical across measure variables; they will
## be dropped

# tea_time is available

# multiple correspondence analysis
mca <- MCA(tea_time, graph = FALSE)

# summary of the model
summary(mca)
## 
## Call:
## MCA(X = tea_time, graph = FALSE) 
## 
## 
## Eigenvalues
##                        Dim.1   Dim.2   Dim.3   Dim.4   Dim.5   Dim.6
## Variance               0.304   0.249   0.188   0.183   0.153   0.148
## % of var.             13.299  10.893   8.218   8.018   6.691   6.483
## Cumulative % of var.  13.299  24.192  32.410  40.428  47.119  53.603
##                        Dim.7   Dim.8   Dim.9  Dim.10  Dim.11  Dim.12
## Variance               0.147   0.142   0.132   0.125   0.115   0.103
## % of var.              6.439   6.197   5.781   5.477   5.026   4.521
## Cumulative % of var.  60.042  66.239  72.020  77.497  82.524  87.045
##                       Dim.13  Dim.14  Dim.15  Dim.16
## Variance               0.095   0.080   0.073   0.048
## % of var.              4.174   3.479   3.210   2.091
## Cumulative % of var.  91.219  94.698  97.909 100.000
## 
## Individuals (the 10 first)
##              Dim.1    ctr   cos2    Dim.2    ctr   cos2    Dim.3    ctr
## 1         | -0.471  0.244  0.050 |  0.273  0.100  0.017 |  0.797  1.127
## 2         | -0.210  0.049  0.026 | -0.233  0.073  0.032 |  0.930  1.534
## 3         | -0.281  0.087  0.093 |  0.043  0.002  0.002 | -0.235  0.098
## 4         | -0.396  0.172  0.180 |  0.099  0.013  0.011 | -0.459  0.375
## 5         | -0.305  0.102  0.106 | -0.137  0.025  0.021 |  0.027  0.001
## 6         | -0.469  0.241  0.088 |  0.413  0.228  0.068 | -0.056  0.006
## 7         | -0.305  0.102  0.106 | -0.137  0.025  0.021 |  0.027  0.001
## 8         | -0.187  0.038  0.021 | -0.053  0.004  0.002 |  0.667  0.790
## 9         |  0.545  0.326  0.130 | -0.729  0.711  0.232 |  0.308  0.169
## 10        |  0.792  0.688  0.287 | -0.605  0.490  0.167 |  0.474  0.399
##             cos2  
## 1          0.143 |
## 2          0.511 |
## 3          0.065 |
## 4          0.243 |
## 5          0.001 |
## 6          0.001 |
## 7          0.001 |
## 8          0.267 |
## 9          0.042 |
## 10         0.103 |
## 
## Categories (the 10 first)
##               Dim.1     ctr    cos2  v.test     Dim.2     ctr    cos2
## black     |   0.436   2.200   0.062   4.310 |  -0.084   0.099   0.002
## Earl Grey |  -0.223   1.510   0.090  -5.189 |  -0.132   0.643   0.031
## green     |   0.330   0.563   0.013   2.007 |   0.960   5.815   0.114
## alone     |  -0.010   0.003   0.000  -0.240 |   0.182   1.238   0.062
## lemon     |   0.530   1.452   0.035   3.222 |  -0.197   0.245   0.005
## milk      |  -0.305   0.920   0.025  -2.721 |  -0.202   0.490   0.011
## other     |   0.415   0.242   0.005   1.261 |  -1.814   5.665   0.102
## how_bag   |  -0.581   8.987   0.441 -11.487 |   0.302   2.964   0.119
## how_both  |   0.363   1.940   0.060   4.239 |  -0.946  16.078   0.408
## how_loose |   1.796  18.183   0.440  11.466 |   1.043   7.495   0.148
##            v.test     Dim.3     ctr    cos2  v.test  
## black      -0.829 |   1.224  28.116   0.491  12.113 |
## Earl Grey  -3.065 |  -0.397   7.702   0.284  -9.214 |
## green       5.835 |  -0.425   1.510   0.022  -2.582 |
## alone       4.293 |  -0.225   2.496   0.094  -5.295 |
## lemon      -1.197 |  -0.791   5.229   0.077  -4.806 |
## milk       -1.797 |   0.892  12.721   0.212   7.956 |
## other      -5.517 |   1.520   5.274   0.071   4.623 |
## how_bag     5.971 |   0.149   0.950   0.029   2.937 |
## how_both  -11.046 |  -0.314   2.354   0.045  -3.671 |
## how_loose   6.662 |   0.119   0.130   0.002   0.763 |
## 
## Categorical variables (eta2)
##             Dim.1 Dim.2 Dim.3  
## Tea       | 0.091 0.114 0.491 |
## How       | 0.056 0.133 0.338 |
## how       | 0.619 0.463 0.045 |
## price     | 0.611 0.304 0.146 |
## sugar     | 0.049 0.010 0.116 |
## where     | 0.699 0.622 0.021 |
## breakfast | 0.002 0.098 0.158 |
# visualize MCA
plot(mca, invisible=c("ind"), habillage = "quali")

First we plot the frequencies of the variables. As we can see, most people use Earl Grey bag tea, drink it alone in a store (home excluded), from all kinds of price ranges, every now and then adding sugar. From the second plot we can see the illustration of the MCA model. We can see that exclusive group ( drinking green upscale loose tea in a tea shop) is more close to each other. The second group noting away from the main group is casual drinkers whom drink in a shop or store a variable price range bag or as loose tea added with a secret ingredient (other, possibly alcohol, I’m thinking about rum toddy here!). I excluded the other MCA graphs as they added little or no further information to the analysis.